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We discuss how the steepest descent method with Fourier acceleration for Laudau gauge fixing in 
lattice SU(3) simulations can be implemented using CUDA. 

The scaling of the gauge fixing code was investigated using a Tesla C2070 Fermi architecture, 
and compared with a parallel CPU gauge fixing code. 
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1. Introduction and motivation 



On the lattice, Landau gauge fixing is performed by maximising the functional 

1 

4KV 



F ^] = z^EE Re [ Tr («W^W« t ( jc +M))] , (i.i) 

x 11 



where N c is the dimension of the gauge group and V the lattice volume, on each gauge orbit. 
One can prove that picking a maximum of F\j [g] on a gauge orbit is equivalent to demand the usual 
continuum Landau gauge condition d^A"^ = and to require the positiveness of the Faddeev-Popov 
determinant. In the literature this gauge is known as the minimal Landau gauge. The functional 
Fu[g] can be maximised using the steepest descent method [1, 2]. However, when the method is 
applied to large lattices, it faces the problem of critical slowing down, which can be attenuated by 
Fourier acceleration. 

In the Fourier accelerated method, at each iteration one chooses 



g[x) = exp 

with 



2 p 2 a 2 



F (l>-v [Uv(x)-U$(x)] -trace^) 



(1.2) 



A- v (U fl (x))=U tl (x-av)-U tl (x), (1.3) 

p 1 are the eigenvalues of (— d 2 ), a is the lattice spacing and F represents a fast Fourier transform 
(FFT). For the parameter a, we use the recommended value 0.08 [1]. In what concerns the com- 
putation of g(x), for numerical puiposes it is enough to expand the exponential to first order in a, 
followed by a reunitarization, i.e., a projection to the SU(3) group. The evolution and convergence 
of the gauge fixing process can be monitored by 

= ^E Tr [ A W At w] (i- 4 ) 

where 

A(x) =^[U v (x -av)-U v (x) -h.c- trace] (1.5) 

V 

is the lattice version of d^A^ = and 6 is the mean value of d^A^ evaluated over all space-time 
lattice points per color degree of freedom. In all the results shown below, gauge fixing was stopped 
only when 6 < 10~ 15 . 



2. Parallel implementation of the gauge fixing algorithm 

2.1 CPU implementation 

The MPI parallel version of the algorithm was implemented in C++, using the machinery 
provided by the Chroma library [3]. The Chroma library is built on top of QDP++, a library which 
provides a data-parallel programming environment suitable for Lattice QCD. The use of QDP++ 
allows the same piece of code to run both in serial and parallel modes. For the Fourier transforms, 
the code uses PFFT, a parallel FFT library written by Michael Pippig [4]. Note that, in order to 
optimize the interface with the PFFT routines, we have compiled QDP++ and Chroma using a 
lexicographic layout. 
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Algorithm 1 Landau gauge fixing using FFTs. 

1: calculate A(jc), F g [U] and 6 

2: while 6 > e do 

3: for all element of A(x) matrix do 

4: apply FFT forward 

5: apply pi ax /p 2 

6: apply FFT backward 

7: normalize 

8: end for 

9: for all x do 

10: obtain g(x) and reunitarize 

ll: end for 

12: for all x do 

13: for all /I do 

14: £/ M (x)^g(*)t/ M (*)^(x+A) 

15: end for 

16: end for 

17: calculate A(jc), F g [U] and 

18: end while 



2.2 GPU implementation 

For the parallel implementation of the SU(3) Landau gauge fixing on GPU's [5], we used 
version 4. 1 of CUDA. For the 4D lattice, we address one thread per lattice site. Although CUDA 
supports up to 3D thread blocks [6] , the same does not happen for the grid, which can be up to 2D 
or 3D depending on the architecture and CUDA version. For grids up to 3D, this support happens 
only for CUDA version 4.x and for a CUDA device compute capability bigger than 1.3, i.e. at this 
moment only for the Fermi and Kepler architectures. Nevertheless, the code is implemented with 
3D thread blocks and for the grid we adapted the code for each situation. Since our problem needs 
four indexes, using 3D thread blocks (one for t, one for z and one for both x and y), we only need 
to reconstruct the other lattice index inside the kernel. 

We use the GPU constant memory to put most of the constants needed by the GPU, like the 
number of points in the lattice, using cudaMemcpyToSymbol(). To store the lattice array in global 
memory, we use a SOA type array as described in [7]. The main reason to do this is due to the FFT 
implementation algorithm. The FFT is applied for all elements of A(x) matrix separately. Using the 
SOA type array, the FFT can be applied directly to the elements without the necessity of copying 
data or data reordering. 

On GPU's, FFT are performed using the CUFFT library by NVIDIA. Since there is no support 
for 4D FFTs, one has to combine four ID FFTs, two 2D FFTs or 3D+1D FFTs. The best choice 
for our 4D problem is two 2D FFTs. 

In order to reduce memory traffic we can use the unitarity of SU(3) matrices and store only the 
first two rows (twelve real numbers) and reconstruct the third row on the fly when needed, instead 
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of storing it. 

To perform Landau gauge fixing in GPUs using CUDA, we developed the following kernels: 

• kl: kernel to obtain an array with p 2 msx j p . 

• k2: kernel to calculate A(x), F g [U] and 6. The sum of F g [U] and 6 over all the lattice sites are 
done with the parallel reduction code in the NVIDIA GPU Computing SDK package, which 
is already an optimized code. 

• k3: kernel to perform a data ordering. 

• k4: apply p^ax/ P 2 an d normalize. 

• k5: obtain g(x) and reunitarize. 

• k6: perform Up (x) — > g(x)U^(x)g^(x + p.). 

In Table 1 , we show the number of floating-point operations and the number of memory load- 
s/stores per thread by kernel. The number of floating-point operations using 2D plus 2D FFTs is 
given by nxxny xnzxnt x 5(log 2 (nx x ny) + log 2 («z xnt)). 

3. Results 

In this section we show the performance of the GPU code, and compare with the parallel CPU 
code. The GPU performance results were performed on a NVIDIA Tesla C2070, Table 2, and 
using version 4.1 of CUDA. The parallel CPU performance were done in the Centaurus cluster, 
at Coimbra. This cluster has 8 cores per node, each node has 24 GB of RAM, with 2 intel Xeon 
E5620@2.4 GHz (quad core) and it is equipped with a DDR Infiniband network. 

In Fig. 1, we show the GPU performance, in GFlops, of the algorithm using a 12 parameter 
reconstruction and the full (18 number) representation in single and double precision. The GPU 
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153 
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162/72 
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108/48 
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Table 1 : Kernel memory loads/stores and number of floating-point operations (flop) per thread by 
kernel. The total number of threads is equal to the lattice volume. For kernel details see the text. 
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NVIDIA Tesla C2070 



Number of GPUs 1 


Global memory 6144 MB 


CUDA Capability 2.0 


Memory bandwidth (ECC off) 148 GBytes/s 


Multiprocessors (MP) 14 


Shared memory (per SM) 48KB or 16KB 


Cores per MP 32 


LI cache (per SM) 16KB or 48KB 


Total number of cores 448 


L2 cache (chip wide) 768KB 


Clock rate 1.15 GHz 


Device with ECC support yes 



Table 2: NVIDIA's graphics card specifications used in this work. 
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Figure 1: Performance in GFlops, [5]. (a) with ECC Off and (b) with ECC On. sp: single precision, 
dp: double precision, 18real: full SU(3) matrix, 12real: 12 real parametrization, tex: using texture 
memory and cache: using LI and L2 cache memory. 



5 



GPU implementation of a Landau gauge fixing algorithm 



Paulo J. Silva 



memory access is also compared, using LI and L2 caches and the texture memory. The best perfor- 
mance is achieved with ECC off, using texture memory and the 12 real number parameterization 
of the SU(3) matrix. 

In order to compare the performance of the two codes, we use a 32 4 lattice volume. The 
configurations have been generated using the standard Wilson gauge action, with three different 
values of j8, 5.8, 6.0 and 6.2. In all runs, we set a = 0.08 and d < 10~ 15 . 

The CPU performance is compared with the best GPU performance, using 12 real parameter- 
ization, texture memory and ECC off, for a 32 4 lattice volume in double precision. The parallel 
CPU performance shows a good strong scaling behavior, Fig. 2, with a linear speed-up against the 
number of computing nodes. Nevertheless, the GPU code was much faster than the parallel CPU 
one: in order to reproduce the performance of the GPU code, one needs 256 CPU cores. 




Figure 2: Strong scaling CPU tested for a 32 4 lattice volume and comparison with the GPU for the 
best performance, 12 real number parametrization, ECC Off and using texture memory in double 
precision, [5]. In Centaurus, a cluster node means 8 computing cores. 



4. Conclusion 

We have compared the performance of the GPU implementation of the Fourier accelerated 
steepest descent Landau gauge fixing algorithm using CUDA with a standard MPI implementa- 
tion built on the Chroma library. The run tests were done using 32 4 /3 = 5.8,6.0,6.2 pure gauge 
configurations, generated using the standard Wilson action. 

In order to optimize the GPU code, its performance was investigated on 32 3 x n gauge config- 
urations. The runs on a C2070 Tesla show that, for a 4D lattice, the best performance was achieved 
with 2D plus 2D FFTs using cufftPlanMany() and using a 12 real parameter reconstruction with 
texture memory. 
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From all the runs using a C2070 Tesla GPU, peak performance was measured as 186/71 
GFlops for single/double precision. From the performance point of view, a run on a single GPU 
delivers the same performance as the CPU code when running on 32 nodes (256 cores), if one 
assumes a linear speed-up behavior, in double precision. 
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